An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

نویسندگان

  • Jörg Drechsler
  • Jerome P. Reiter
چکیده

When intense redaction is needed to protect data subjects’ confidentiality, statistical agencies can release synthetic data, in which identifying or sensitive values are replaced with draws from statistical models estimated from the confidential data. Specifying accurate synthesis models can be a difficult and labor intensive task with standard parametric approaches. We describe and empirically evaluate four easy-to-implement, nonparametric synthesizers based on machine learning algorithms—classification and regression trees, bagging, random forests, and support vector machines—on their potential to preserve analytical validity and reduce disclosure risks. The results suggest that synthesizers based on regression trees can provide high utility with low disclosure risks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

Synthetic Datasets for the German IAB Establishment Panel

Disseminating microdata to the public that provide a high level of data utility while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the appr...

متن کامل

On the Generation of Spatiotemporal Datasets

An efficient benchmarking environment for spatiotemporal access methods should at least include modules for: generating synthetic datasets, storing datasets (real datasets included), collecting and running access structures, and visualizing experimental results. Focusing on the dataset repository module, a collection of synthetic data that would simulate a variety of real life scenarios is requ...

متن کامل

Nonparametric Estimation of Multi-View Latent Variable Models

Spectral methods have greatly advanced the estimation of latent variable models, generating a sequence of novel and efficient algorithms with strong theoretical guarantees. However, current spectral algorithms are largely restricted to mixtures of discrete or Gaussian distributions. In this paper, we propose a kernel method for learning multi-view latent variable models, allowing each mixture c...

متن کامل

Computational aspects of nonparametric smoothing with illustrations from the sm library

Smoothing techniques such as density estimation and nonparametric regression are widely used in applied work and the basic estimation procedures can be implemented relatively easily in standard statistical computing environments. However, computationally e2cient procedures quickly become necessary with large datasets, many evaluation points or more than one covariate. Further computational issu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Statistics & Data Analysis

دوره 55  شماره 

صفحات  -

تاریخ انتشار 2011